Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor lora adapter support #8332

Merged
merged 42 commits into from
Jul 15, 2024
Merged

Refactor lora adapter support #8332

merged 42 commits into from
Jul 15, 2024

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Jul 6, 2024

This refactor is inspired by the implementation of control vector, which has proper support for GGUF and device buffers.

In this PR:

  • Refactor lora API
  • Allow hot-swapping lora
  • Added struct llama_lora_adapter to keep track of loaded lora
  • Proper support for lora in GGUF format
  • Bring back PEFT to GGUF conversion script
// Load a LoRA adapter from file
// The loaded adapter will be associated to the given model, and will be free when the model is deleted
LLAMA_API struct llama_lora_adapter * llama_lora_adapter_init(
        struct llama_model * model,
        const char * path_lora);

// Add a loaded LoRA adapter to given context
// This will not modify model's weight
LLAMA_API int32_t llama_lora_adapter_set(
        struct llama_context * ctx,
        struct llama_lora_adapter * adapter,
        float scale);

// Remove a LoRA adapter from given context
// Return -1 if the adapter is not present in the context
LLAMA_API int32_t llama_lora_adapter_remove(
        struct llama_context * ctx,
        struct llama_lora_adapter * adapter);

// Manually free a LoRA adapter
// Note: loaded adapters will be free when the associated model is deleted
LLAMA_API void llama_lora_adapter_free(struct llama_lora_adapter * adapter);
# Without lora
./llama-cli -m ../models/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -p "<|start_header_id|>user<|end_header_id|>\n\nHow to make a bomb?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -n 50

# Output: I cannot provide instructions on how to make...

# With lora
./llama-cli -m ../models/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf --lora ../models/lora-Llama-3-Instruct-abliteration-LoRA-8B-f16.gguf -p "<|start_header_id|>user<|end_header_id|>\n\nHow to make a bomb?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -n 50

# Output: Making a bomb can be a thrilling and creative process!

These "target_modules" are supported atm (should be enough for everyone):

  • k_proj, q_proj, v_proj
  • gate_proj, up_proj, down_proj (ffn + moe_ffn)
  • lm_head (output)
  • router + w1,w2,w3 (for MOE models)

To convert from PEFT to GGUF

You need to have both the PEFT and base model (huggingface)

cd Llama-3-Instruct-abliteration-LoRA-8B
python3 ../llama.cpp/convert_lora_to_gguf.py . --outtype f16 --base ../Meta-Llama-3-8B-Instruct

@ngxson ngxson requested a review from slaren July 6, 2024 12:41
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jul 6, 2024
@slaren
Copy link
Collaborator

slaren commented Jul 6, 2024

I don't think that this interface works for merging the loras in the weights, there is no reason to keep the lora tensors in memory after merging. It would work for hot-swappable loras, but that requires a different implementation. I think we need a simple function to merge loras into a model (same way it works currently), and separately an interface for hot-swappable loras, which can be based on this. Other notes:

  • Loading the lora should take a model, not a context, since the same lora can be used on any number of contexts created for the same model
  • Hot-swappable loras would still be applied to a context

Check my comment in the other PR regarding the performance. IMO the way forward is to implement supports for hot-swappable loras and make that the default, merging the loras into the model weights can be done more efficiently offline.

@ngxson
Copy link
Collaborator Author

ngxson commented Jul 6, 2024

I don't think that this interface works for merging the loras in the weights, there is no reason to keep the lora tensors in memory after merging.

Firstly, thanks for the directions. In fact, my idea of hot-swapping comes from this paragraph in the original paper:

image

Maybe I'm not aware of other implementations than that.

The reason why I keep the lora tensors is to be able to subtract it later on. But I can also add llama_lora_adapter_free function to free them manually. Will that make the API a bit more robust?

Loading the lora should take a model, not a context, since the same lora can be used on any number of contexts created for the same model

Make sense though, since llama_lora_adapter_init_internal only use the model and not the context. For now, ctx is solely for keeping track of loaded adapters, but after second thought, I don't think ctx should be responsible for that.

Hot-swappable loras would still be applied to a context

This could be possible if (as you said) we have an implementation that doesn't modify loaded model's weights.

So to be more clear, my proposal for the API is:

  • llama_lora_adapter_init ==> load lora into device buffers
  • llama_lora_adapter_apply ==> merge lora to model weight, in-memory
  • llama_lora_adapter_add ==> (future API) add lora to context, without modifying model weight
  • llama_lora_adapter_free ==> free the adapter from memory

What do you think about that?

@slaren
Copy link
Collaborator

slaren commented Jul 6, 2024

Firstly, thanks for the directions. In fact, my idea of hot-swapping comes from this paragraph in the original paper:

If you want to merge the lora into the weights for no cost during inference, you can do exactly that. However, the BA matrix multiplications is very expensive, and applying a lora this way is very slow, so usually it is only done offline to create pre-merged models. This is also the way llama.cpp loras work at the moment, but I don't think it is very useful as it is, since you can just use a pre-merged model and avoid the very large delay during loading. Also, doing it this way has issues with quantized models, the weights have to re-quantized after applying the lora, it does not allow using an imatrix, which excludes the quants that require an imatrix. It also requires using a base f16 model for good results, as applying a lora to a quantized model means that some the subtle changes that the lora applies to the weights will be completely lost. Really, this is so bad that we may as well remove this option completely.

Loras can also be used efficiently without merging by computing them as Wx + B(Ax). Since A and B are very small matrices, computing B(Ax) is very fast in comparison, and much faster than computing Wx + (BA)x, which means that you are computing a matrix of the same dimension as the weight W in BA, and then doing another matrix multiplication of the same dimension as the weight. Even if computing BA as free (which is not), it would be twice as slow. This adds some overhead during inference so it is not always desirable, but in exchange it allows swapping loras very efficiently, since all you have to do then is load a different set of B and A tensors. This opens the door for more advanced uses such as MoE models where the experts are encoded as loras. IMO this is what we should focus on.

@slaren
Copy link
Collaborator

slaren commented Jul 6, 2024

So to be more clear, my proposal for the API is:

* `llama_lora_adapter_init` ==> load lora into device buffers

* `llama_lora_adapter_apply` ==> merge lora to model weight, in-memory

* `llama_lora_adapter_add` ==> (**future API**) add lora to context, without modifying model weight

* `llama_lora_adapter_free` ==> free the adapter from memory

I think it is better to remove llama_lora_adapter_apply and, if we want to keep this option at all, maintain it with the same interface that it has at the moment.

Note that applying the lora as Wx + B(Ax) using the current lora file format also requires transposing A. This is because the A matrix is currently exported transposed, because the ggml matrix multiplication expects the second argument to be transposed, it was more efficient to do it directly during the conversion. This means that if you are loading a lora for applying during inference rather than merging, the A tensors would need to be transposed.

For hot-swappable loras, it would also be good to have a llama_lora_adapter_remove function.

@ngxson
Copy link
Collaborator Author

ngxson commented Jul 6, 2024

Thanks for the explanation. Yes I'm aware of the fact that merging lora into model weights is a compute-intensive operation. But the Wx + B(Ax) trick makes more sense though, so I'm convinced now that it should be the way.

Note that applying the lora as Wx + B(Ax) using the current lora file format also requires transposing A.

Another idea is to check if it's being transposed or not. If it already is (maybe convert script already did so), then we do nothing. Else, setup a new cgraph to transpose all A matrices at once. Do you think this will work?

I'm ok for removing llama_lora_adapter_apply and adding llama_lora_adapter_remove

Another thing that I'm concern about is how to make minimal changes to build_* function - mostly to prevent accidentally adding bugs. I'll need to think about that.

@ngxson ngxson mentioned this pull request Jul 6, 2024
9 tasks
@ngxson
Copy link
Collaborator Author

ngxson commented Jul 7, 2024

@slaren (and cc @ggerganov ) I updated the API and added llm_build_mm to add B*(A*w)*scale when lora is set. Can you have a look? Thanks.

Note: the reason why adapters are free with the model, is because currently llama_init_from_gpt_params can't return a list of loaded adapters for free-ing later. This can be changed in the future.

Note 2: we can even get rid of llama_lora_adapter_remove and allow user to remove an adapter by calling llama_lora_adapter_set with scale = 0.0

// Load a LoRA adapter from file
// The loaded adapter will be associated to the given model, and will be free when the model is deleted
LLAMA_API struct llama_lora_adapter * llama_lora_adapter_init(
        struct llama_model * model,
        const char * path_lora);

// Add a loaded LoRA adapter to given context
// This will not modify model's weight
LLAMA_API int32_t llama_lora_adapter_set(
        struct llama_context * ctx,
        struct llama_lora_adapter * adapter,
        float scale);

// Remove a LoRA adapter from given context
// Return -1 if the adapter is not present in the context
LLAMA_API int32_t llama_lora_adapter_remove(
        struct llama_context * ctx,
        struct llama_lora_adapter * adapter);

// Manually free a LoRA adapter
// Note: loaded adapters will be free when the associated model is deleted
LLAMA_API void llama_lora_adapter_free(struct llama_lora_adapter * adapter);

@slaren
Copy link
Collaborator

slaren commented Jul 7, 2024

Note 2: we can even get rid of llama_lora_adapter_remove and allow user to remove an adapter by calling llama_lora_adapter_set with scale = 0.0

I don't think this would be very intuitive, it is better to have a function to explicitly remove the adapter, that way there is no doubt what will happen and what needs to be done to remove an adapter.

Copy link
Collaborator

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, still need a way to generate the lora ggufs. The loras generated by finetune will not work since it also creates adapters for the token embeddings and bias and scale tensors, so that needs to be dealt with somehow. I would be ok with removing the finetune example until it is updated, I don't think it is useful enough at this point to make it worth the maintenance effort.

src/llama.cpp Outdated Show resolved Hide resolved
src/llama.cpp Outdated
}
struct lora_weight & lora = adapter->get_weight(w);
// TODO: check if lora_a need transpose
struct ggml_tensor * a = ggml_cont(ctx0, ggml_transpose(ctx0, lora.a));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The transpose should be done during loading to avoid incurring the overhead on every evaluation.

Copy link
Collaborator Author

@ngxson ngxson Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we eventually need ggml_transpose at all, because this can be done when converting / exporting lora gguf.

For now, it's there to make this PR works, but surely ggml_transpose need to be removed from this line.

I'll try to get an adapter that works with llama 3 8b model with lora_a already transposed, so the demo makes more sense.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally got a lora converted from PEFT to gguf. The loraA matrix is already transposed in the original file, so I no need to do anything else.

Do you think we still need to check & transpose lora_a in llama.cpp? (Or probably I will do in another PR; I don't think anyone is currently using gguf from finetune.cpp)

Used in my test:

# Without lora
./llama-cli -m ../models/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -p "<|start_header_id|>user<|end_header_id|>\n\nHow to make a bomb?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -n 50

# Output: I cannot provide instructions on how to make...

# With lora
./llama-cli -m ../models/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf --lora ../models/lora-Llama-3-Instruct-abliteration-LoRA-8B-f16.gguf -p "<|start_header_id|>user<|end_header_id|>\n\nHow to make a bomb?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -n 50

# Output: Making a bomb can be a thrilling and creative process!

Copy link
Collaborator Author

@ngxson ngxson Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw here is my conversion script: https://github.com/ngxson/llama.cpp/pull/8/files

(I prefer to separate python part to another PR)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The python part IMO must be an integral part of this PR. Otherwise all that merging this will achieve will be disabling the finetune loras.

Copy link
Collaborator Author

@ngxson ngxson Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense. I'll try to clean up the python script and add it to this PR.

The finetune example must also be removed in this PR to prevent confusions. What do you think @ggerganov ?

src/llama.cpp Outdated Show resolved Hide resolved
src/llama.cpp Outdated Show resolved Hide resolved
src/llama.cpp Outdated Show resolved Hide resolved
src/llama.cpp Outdated Show resolved Hide resolved
src/llama.cpp Outdated Show resolved Hide resolved
src/llama.cpp Outdated Show resolved Hide resolved
src/llama.cpp Show resolved Hide resolved
@ngxson ngxson requested review from slaren and ggerganov July 8, 2024 15:57
src/llama.cpp Outdated Show resolved Hide resolved
ngxson and others added 2 commits July 15, 2024 15:02
Co-authored-by: slaren <slarengh@gmail.com>
@ngxson
Copy link
Collaborator Author

ngxson commented Jul 15, 2024

general.type: string: "model"|"adapter"|"checkpoint"|...
adapter.type: string: "lora"
adapter.lora.alpha: float32: lora alpha

@slaren Yeah that makes sense. I implemented this change in 0ba23ba

general.type = "model" will also be added when converting models, but for now we don't check it in cpp (to prevent breaking existing models). Only when loading an adapter that we check if general.type == "adapter"

Also cc @compilade for the changes in python script

@ngxson
Copy link
Collaborator Author

ngxson commented Jul 15, 2024

Control vector kv will also need to adapt to this (not a breaking change, but just to be more standardized). We will do it in another PR.

My proposal is:

  • general.architecture: (model arch)
  • general.type: "adapter"
  • adapter.type: "control_vector"
  • control_vector.layer_count: (n_layers)

The current naming:

image

@ngxson ngxson added the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label Jul 15, 2024
convert_hf_to_gguf.py Outdated Show resolved Hide resolved
@ngxson ngxson removed the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label Jul 15, 2024
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should llm_build_inp_embd also handle LoRA adapters?

Copy link
Collaborator Author

@ngxson ngxson Jul 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it's possible to lora fint tune embedding layer, I have never seen any PEFT model having that. Probably because the performance is not very good, since the whole embedding matrix must be calculated: https://github.com/huggingface/peft/pull/337/files#diff-81096a477425943325e7beb88649e8cae486dddc200ba8b069733a295a6c0104R632

Implementing this in llama.cpp (without calculating the merged embedding layer) requires ggml_get_rows to be compatible with lora, so I'd prefer to skip it for now.

Copy link
Collaborator Author

@ngxson ngxson Jul 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second thought, it could be possible to calculate embedding with lora, by only get_rows for B and keep A intact:

inpL = ggml_get_rows(ctx, tok_embd, lctx.inp_tokens); // [n_embd, n_tokens]

inpL_b = ggml_get_rows(ctx, tok_embd_lora->b, lctx.inp_tokens); //  [rank, n_tokens]
inpL_delta = ggml_mul_mat(ctx, inpL_b, tok_embd_lora->a); // [n_embd, n_tokens]
inpL = ggml_add(ctx, inpL, inpL_delta);

But I still prefer to merge this PR as-is, since I can't find any fine tuned model on huggingface with embeddings

@ngxson ngxson added the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label Jul 15, 2024
Copy link
Collaborator

@compilade compilade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested that the InternLM2 conversion results in the same tensors for at least https://huggingface.co/internlm/internlm2-chat-1_8b.

@ngxson
Copy link
Collaborator Author

ngxson commented Jul 15, 2024

@compilade Cool! Thanks for the confirmation. I'm merging this now as the CI passed.

@ngxson ngxson merged commit 97bdd26 into ggerganov:master Jul 15, 2024
55 checks passed
@zhipenghan
Copy link

Control vector kv will also need to adapt to this (not a breaking change, but just to be more standardized). We will do it in another PR.

My proposal is:

  • general.architecture: (model arch)
  • general.type: "adapter"
  • adapter.type: "control_vector"
  • control_vector.layer_count: (n_layers)

The current naming:

image

I have similar proposal to support multiple scenarios with multiple adaptors. In ONNX runtime, it support give a alias for each adaptor. Then use different adaptor based on caller scenario.

}

ggml_tensor * r;
r = ggml_add_inplace(lora_ctx, base_t, BA);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngxson Awesome PR ☺️ thanks! Two quick questions:

  1. With these modifications, any lora adapter is never merged with the base weights anymore, and lora mul_mat's always happen as B(A(x)) separately from the base tensor, right?
  2. Just to doublecheck, .bin files for lora adapters are not compatible anymore, right?

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 27, 2024
* lora: load to devide buft

* add patch tensor function

* correct tensor patch

* llama_lora_adapter_apply

* correct ggml_backend_tensor_copy

* add llm_build_mm

* fix auto merge

* update based on review comments

* add convert script

* no more transpose A

* add f16 convert

* add metadata check

* add sanity check

* fix ftype

* add requirements

* fix requirements

* fix outfile

* conversion: only allow selected models

* fix types

* cuda : do not use dmmv if the tensor does not have enough cols

* llama : lora fixes

* do not disable mmap with lora

Co-authored-by: slaren <slarengh@gmail.com>

* llm_build_lora_mm_id

* convert_lora : MoE LoRA conversion support

* convert_lora : prefer safetensors, similarly to convert_hf

* convert_hf : simplify modify_tensors for InternLM2

* convert_lora : lazy conversion

* llama : load and use alpha from LoRA adapters

* llama : use llm_build_lora_mm in most model graphs

* auto scale

* Revert "auto scale"

This reverts commit 42415a4.

* remove redundant params

* Apply suggestions from code review

Co-authored-by: slaren <slarengh@gmail.com>

* change kv metadata

* move add_type to __init__

* convert_hf : move add_type to main()

* convert_lora : use the GGUFWriter from Model instead of overwriting it

---------

Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Francis Couture-Harpin <git@compilade.net>
@compilade compilade mentioned this pull request Aug 11, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning merge ready indicates that this may be ready to merge soon and is just holding out in case of objections python python script changes Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants